基于查询的变压器在许多图像域任务中构建长期注意力方面表现出了巨大的潜力,但是由于点云数据的压倒性大小,在基于激光雷达的3D对象检测中很少考虑。在本文中,我们提出了CenterFormer,这是一个基于中心的变压器网络,用于3D对象检测。 CenterFormer首先使用中心热图在基于标准的Voxel点云编码器之上选择中心候选者。然后,它将中心候选者的功能用作变压器中的查询嵌入。为了进一步从多个帧中汇总功能,我们通过交叉注意设计一种方法来融合功能。最后,添加回归头以预测输出中心功能表示形式上的边界框。我们的设计降低了变压器结构的收敛难度和计算复杂性。结果表明,与无锚对象检测网络的强基线相比,有了显着改善。 CenterFormer在Waymo Open数据集上实现了单个模型的最新性能,验证集的MAPH为73.7%,测试集的MAPH上有75.6%的MAPH,大大优于所有先前发布的CNN和基于变压器的方法。我们的代码可在https://github.com/tusimple/centerformer上公开获取
translated by 谷歌翻译
与其他基于架构的NAS方法不同,广泛的神经结构搜索(BNA)提出了一个广泛的,它由卷积和增强块组成,被称为广泛的卷积神经网络(BCNN)作为搜索空间,以惊人的效率改进。 BCNN重用卷积块中的单元格的拓扑,使得BNA可以使用很少的小区以获得有效的搜索。此外,提出了多尺度特征融合和知识嵌入,以提高BCNN具有浅层拓扑的性能。然而,BNA遭受了一些缺点:1)特征融合和增强的代表性多样性不足,2)人类专家对知识嵌入设计的耗时。在本文中,我们提出了堆叠的BNA,其搜索空间是名为堆叠BCNN的开发的广泛可扩展架构,性能比BNA更好。一方面,堆叠的BCNN将Mini-BCNN视为保存综合表示的基本块,并提供强大的特征提取能力。另一方面,我们提出了知识嵌入搜索(KES)来学习适当的知识嵌入。实验结果表明,1)堆叠的BNA获得比BNA,2)KES有助于降低具有令人满意的性能的学习架构参数,3)堆叠BNA可提供0.02 GPU天的最新效率。
translated by 谷歌翻译
引导深度超分辨率(GDSR)是多模态图像处理中的必要主题,其在同一场景的HR RGB图像的帮助下重建与次优条件的低分辨率的高分辨率(HR)深度映射。为了解决解释工作机制的挑战,提取过度转移的跨模型特征和RGB纹理,我们提出了一种新颖的离散余弦变换网络(DCTNet)来缓解三个方面的问题。首先,离散余弦变换(DCT)模块通过使用DCT来解决来自GDSR的图像域的频道明智的优化问题来重建多通道HR深度特征。其次,我们介绍了一个半耦合特征提取模块,使用共享卷积核,以提取公共功能和私有内核,以提取特定的模态特征。第三,我们采用了边缘注意机制,以突出导致导游的轮廓。广泛的定量和定性评估表明了我们的DCTNET的有效性,这优于以前的最先进方法,具有相对较少的参数。代码将公开。
translated by 谷歌翻译
Pansharpening是一种广泛使用的图像增强技术,用于遥感。其原理是熔断输入的高分辨率单通道平面(PAN)图像和低分辨率多光谱图像,并获得高分辨率多光谱(HRMS)图像。现有的深度学习泛散歌方法有两个缺点。首先,需要沿信道维度连接两个输入图像的特征以重建HRMS图像,这使得PAN图像的重要性不突出,并且还导致高计算成本。其次,通过手动设计的损耗功能难以提取特征的隐式信息。为此,我们通过用于粉彩的快速引导滤波器(FGF)提出一种生成的对抗性网络。在发电机中,传统的信道级联被FGF替换,以更好地保留空间信息,同时减少参数的数量。同时,融合对象可以通过空间注意模块突出显示。此外,通过对抗性训练可以有效地保存特征的潜在信息。许多实验说明我们的网络生成了可以超越现有方法的高质量HRMS图像,以及更少的参数。
translated by 谷歌翻译
Generating controllable and editable human motion sequences is a key challenge in 3D Avatar generation. It has been labor-intensive to generate and animate human motion for a long time until learning-based approaches have been developed and applied recently. However, these approaches are still task-specific or modality-specific\cite {ahuja2019language2pose}\cite{ghosh2021synthesis}\cite{ferreira2021learning}\cite{li2021ai}. In this paper, we propose ``UDE", the first unified driving engine that enables generating human motion sequences from natural language or audio sequences (see Fig.~\ref{fig:teaser}). Specifically, UDE consists of the following key components: 1) a motion quantization module based on VQVAE that represents continuous motion sequence as discrete latent code\cite{van2017neural}, 2) a modality-agnostic transformer encoder\cite{vaswani2017attention} that learns to map modality-aware driving signals to a joint space, and 3) a unified token transformer (GPT-like\cite{radford2019language}) network to predict the quantized latent code index in an auto-regressive manner. 4) a diffusion motion decoder that takes as input the motion tokens and decodes them into motion sequences with high diversity. We evaluate our method on HumanML3D\cite{Guo_2022_CVPR} and AIST++\cite{li2021learn} benchmarks, and the experiment results demonstrate our method achieves state-of-the-art performance. Project website: \url{https://github.com/zixiangzhou916/UDE/
translated by 谷歌翻译
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_{\rm BASE}$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.
translated by 谷歌翻译
Achieving accurate and automated tumor segmentation plays an important role in both clinical practice and radiomics research. Segmentation in medicine is now often performed manually by experts, which is a laborious, expensive and error-prone task. Manual annotation relies heavily on the experience and knowledge of these experts. In addition, there is much intra- and interobserver variation. Therefore, it is of great significance to develop a method that can automatically segment tumor target regions. In this paper, we propose a deep learning segmentation method based on multimodal positron emission tomography-computed tomography (PET-CT), which combines the high sensitivity of PET and the precise anatomical information of CT. We design an improved spatial attention network(ISA-Net) to increase the accuracy of PET or CT in detecting tumors, which uses multi-scale convolution operation to extract feature information and can highlight the tumor region location information and suppress the non-tumor region location information. In addition, our network uses dual-channel inputs in the coding stage and fuses them in the decoding stage, which can take advantage of the differences and complementarities between PET and CT. We validated the proposed ISA-Net method on two clinical datasets, a soft tissue sarcoma(STS) and a head and neck tumor(HECKTOR) dataset, and compared with other attention methods for tumor segmentation. The DSC score of 0.8378 on STS dataset and 0.8076 on HECKTOR dataset show that ISA-Net method achieves better segmentation performance and has better generalization. Conclusions: The method proposed in this paper is based on multi-modal medical image tumor segmentation, which can effectively utilize the difference and complementarity of different modes. The method can also be applied to other multi-modal data or single-modal data by proper adjustment.
translated by 谷歌翻译
基于激光雷达的3D对象检测,语义分割和全景分段通常在具有独特架构的专业网络中实现,这些网络很难相互适应。本文介绍了Lidarmultinet,这是一个基于激光雷达的多任务网络,该网络统一了这三个主要的激光感知任务。在其许多好处中,多任务网络可以通过在多个任务中分享权重和计算来降低总成本。但是,与独立组合的单任务模型相比,它通常表现不佳。拟议的Lidarmultinet旨在弥合多任务网络和多个单任务网络之间的性能差距。 Lidarmultinet的核心是一个强大的基于3D Voxel的编码器架构,具有全局上下文池(GCP)模块,从激光雷达框架中提取全局上下文特征。特定于任务的头部添加在网络之上,以执行三个激光雷达感知任务。只需添加新的任务特定的头部,可以在引入几乎没有额外成本的同时,就可以实现更多任务。还提出了第二阶段来完善第一阶段的分割并生成准确的全景分割结果。 Lidarmultinet在Waymo Open数据集和Nuscenes数据集上进行了广泛的测试,这首先证明了主要的激光雷达感知任务可以统一在单个强大的网络中,该网络是经过训练的端到端,并实现了最先进的性能。值得注意的是,Lidarmultinet在Waymo Open数据集3D语义分割挑战2022中达到了最高的MIOU和最佳准确性,对于测试集中的22个类中的大多数,仅使用LIDAR点作为输入。它还为Waymo 3D对象检测基准和三个Nuscenes基准测试的单个模型设置了新的最新模型。
translated by 谷歌翻译
由路由器控制的稀疏激活模型(MOE)层的混合物(MOE)层在深度学习方面取得了巨大的成功。但是,对这种建筑的理解仍然难以捉摸。在本文中,我们正式研究了MOE层如何改善神经网络学习的性能以及为什么混合模型不会崩溃成单个模型。我们的经验结果表明,基本问题的集群结构和专家的非线性对于MOE的成功至关重要。为了进一步理解这一点,我们考虑了固有群集结构的具有挑战性的分类问题,这很难使用单个专家学习。然而,使用MOE层,通过将专家选择为两层非线性卷积神经网络(CNN),我们表明可以成功地学习问题。此外,我们的理论表明,路由器可以学习群集中心的特征,这有助于将输入复杂问题分为单个专家可以征服的更简单的线性分类子问题。据我们所知,这是正式了解MOE层的深度学习机制的第一个结果。
translated by 谷歌翻译
该技术报告介绍了Waymo打开数据集3D语义分割挑战2022的第一名获胜解决方案。我们的网络称为Lidarmultinet,统一了单个框架中的3D语义细分,对象检测和泛型分割等主要激光镜感知任务。 Lidarmultinet的核心是一个强大的基于3D Voxel的编码器网络,具有新型的全局上下文池(GCP)模块,从激光雷达框架中提取全局上下文特征,以补充其本地功能。提出了一个可选的第二阶段,以完善第一阶段的分割或生成准确的全景分割结果。我们的解决方案达到了71.13的MIOU,对于Waymo 3D语义细分测试集的22个类中的大多数是最好的,它的表现优于官方排行榜上所有其他3D语义分段方法。我们首次证明,可以在可以端对端训练的单个强大网络中统一重大激光感知任务。
translated by 谷歌翻译